Application of Lempel-Ziv Factorization to the Approximation of Grammar-Based Compression
نویسنده
چکیده
We present almost linear time (O(n · log |Σ|) time) O(log n)ratio approximation of minimal grammar-based compression of a given string of length n over an alphabet Σ and O(k · log n) time transformation of LZ77 encoding of size k into a grammar-based encoding of size O(k · log n). Computing exact size of the minimal grammar-based compression is known to be NP -complete. The basic novel tool is the AV L-grammar.
منابع مشابه
CHICO: A Compressed Hybrid Index for Repetitive Collections
Indexing text collections to support pattern matching queries is a fundamental problem in computer science. New challenges keep arising as databases grow, and for repetitive collections, compressed indexes become relevant. To successfully exploit the regularities of repetitive collections different approaches have been proposed. Some of these are Compressed Suffix Array, Lempel-Ziv, and Grammar...
متن کاملOn the Approximation Ratio of Lempel-Ziv Parsing
Shannon’s entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is b, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing b is NP-complete, a popular gold standard is z, the number of phrases in th...
متن کاملLempel-Ziv factorization: Simple, fast, practical
For decades the Lempel-Ziv (LZ77) factorization has been a cornerstone of data compression and string processing algorithms, and uses for it are still being uncovered. For example, LZ77 is central to several recent text indexing data structures designed to search highly repetitive collections. However, in many applications computation of the factorization remains a bottleneck in practice. In th...
متن کاملLempel-Ziv Factorization Using Less Time & Space
For 30 years the Lempel-Ziv factorization LZx of a string x = x[1..n] has been a fundamental data structure of string processing, especially valuable for string compression and for computing all the repetitions (runs) in x. Traditionally the standard method for computing LZx was based on Θ(n)-time (or, depending on the measure used, O(n log n)-time) processing of the suffix tree STx of x. Recen...
متن کاملUniversal Compressed Text Indexing
The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the ru...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002